By Julien Hernandez Lallement, 2020-12-02, in category Tutorial
Since my transition from academia to the private sector, I realized that one extreme challenge that businesses face is the scalability of the processes they develop. In other words, building a process (might it be a simple data flow, a dashboard or a more complex ML pipeline) is one thing, but deploying that process so that execution and monitoring are scheduled, automatic and easy to debut is another.
In this post, I will present a use case I faced in a project necessiting a dashboard to be updated on a daily basis, and how I used Google Cloud Platform (GCP) to scale up the project to allow automatic execution.
In this project, I was asked to provide an interactive dashboard that monitored a series of KPIs and other indicators regarding digital communication in the company.
The first steps, as usual, involved inquiring on data location, building data pipes, and finally building the dashboard. I have another post rather focused on such processes, so I won't be focusing on this aspect here.
Once the pipeline had been built, I needed to execute it every morning in order to feed the dashboard with fresh data. That was annoying. I needed to come up with a scaled system that allowed the execution of the pipeline on schedule, and allowed users to monitor that the pipeline was correctly executed.
GCP is a terrific tool that gets virtually constant improvements and features. I learned how to use GCP on my own, so do not take my proceedings as best practice. There might be other functionnalities I am unaware of that would allow you to run this process in a more efficient way.
The first step was to export the pipeline on GCP to run it on the cloud.
I first created a Virtual Machine instance on the GCP, under the AI Platform / Notebooks
. This section provide JupyterLab
interfaces that allow you to explore data in a notebook framework.
Upon creation of the instance, you can START, the virtual machine. Note that you will get bill for running instances, so make sure you STOP them when done.
PATH = "C:\\Users\\jhernandez-lallement\\Documents\\GCP_post\\"
Image(filename = PATH + "Capture1.PNG", width=800, height=500)
Once in the JupyterLab
interface, you can open a terminal and Git Clone your repository. I am assuming that the reader knows Git, and has its library pushed on some kind of online repository.
Image(filename = PATH + "GitClone.PNG", width=800, height=500)
Once your repo is cloned, you will get the folder containing your library copied on your VM, which in my case contained several files & folders, as displayed below:
Image(filename = PATH + "Repo.PNG", width=350, height=200)
Now that your code is present on the cloud, you can run all your computations on the VM and make sure that everything works as expected. There might be some issues if you code was written under a different OS, so it always is a good idea to relaunch the code in here and make sure the output is the one needed.
Docker is kind of a lighteight virtual machine on its own, which contains all the components you need to run an application. While I found it quite complicated to understand, you can see it as a bubble, a snapshot of an application that would allow yourself & others to re-run your code somewhere else under the exact same conditions. You can find more information on Docker here, for now I will assume you know the basics of using Docker
Since you know the basics ;), you know that you should use a Dockerfile
to specific how to build and run your container. The Dockerfile
is a text file named exactly with that terminology, that contains all commands required to run your application.
You can create a Dockerfile in your folder root, and add commands as required by the project. In my case, it required following:
FROM Python: 3.8
COPY ./src/jhl001_digital_communication ./
COPY ./requirements.txt ./requirements.txt
COPY ./dat_ini/mycreds.txt ./mycreds.txt
COPY ./dat_ini/client_secrets.json ./client_secrets.json
RUN pip install -r requirements.txt
ENTRYPOINT ["python", "run_pipe_all.py"]
The file run_pipe_all.py
contains the pipeline I needed to execute, all in one file. As a reminder, this pipeline fetched data in different locations, munged it, and exported it in another location to feed a Tableau Dashboard.
The file looked like that, with a few commented lines with files I did not need after testing.
Image(filename = PATH + "DockerFile.PNG", width=350, height=200)
Nice! So everything is ready to build your container, which you can do easily by running the following line in the terminal
docker build . -t gcr.io/PROJECT_NAME/CONTAINER_NAME:latest
In my case, I am using a Test Project (you can find the ID of your project in the Project section of GCP) and I named the container dashboard
.
The :latest
informs Docker that the container should be built as the latest version.
Image(filename = PATH + "container_build.PNG", width=500, height=500)
Now, you have containerized your application / code. Running the container should do the same as running the final line of code in your VM, where all dependencies have already been installed under your Python environment
You can test this by typing this command in the terminal
docker run -ti gcr.io/PROJECT_NAME/CONTAINER_NAME:latest
Image(filename = PATH + "container_run.PNG", width=500, height=500)
In my case, the python file is executed and the pipeline, which I put together into one single file is executed.
Good! Your code is now packed, and you can use this container to share it with others, or simply execute it easily through this interface.
Now, let's see if we can use other GCP features to factorize this code and execute it on regular schedules
In order to share your container, you can push
it by using the command line below:
docker push gcr.io/PROJECT_NAME/CONTAINER_NAME:latest
This will place your container on Container Registry to access it from other interfaces
Image(filename = PATH + "container_push.PNG", width=500, height=500)
If you now go on Container Registry (left GCP toolbar), you should see your container located in the list
Image(filename = PATH + "Container_Registry.PNG", width=600, height=800)
You can see how my container, the latest
image of it, is now present on Container Registry. FYI, an docker image is a blueprint of what needs to be built when running the container.
Good, now you have your container image pushed. Let's use other GCP features to schedule its execution & monitoring!
Once your image is present on the registry, the next step is to deply the container on a GCE VM. One neat feature of GCE is that it will automatically provide a Container-Optimized OS with DOcker installer on it. This means that the container will be executed as soon as the VM is launched.
Below, you see the new VM called update-dashboard
I created to host the container image.
Note how the VM dashboard-communication
created under the AT Platform / Notebook
interface is also present in GCE.
Image(filename = PATH + "GCE.PNG", width=600, height=800)
To do so:
Image(filename = PATH + "GCE_Container.PNG", width=600, height=800)
Now your container is hosted in a GCE VM.
Let's now use two GCP components, Cloud Scheduler
and Cloud Functions
. I used this Google documentation to implement these functionnalities, which is quite easy to follow.
The steps below are taken from the Google documentation, but I put them for consistency
To create a Cloud Function to start your VM instance:
A New pub/sub topic dialog box should appear.
Click Save at the bottom of the Trigger box.
On the left side of the code editor, select index.js.
const Compute = require('@google-cloud/compute');
const compute = new Compute();
/**
* Starts Compute Engine instances.
*
* Expects a PubSub message with JSON-formatted event data containing the
* following attributes:
* zone - the GCP zone the instances are located in.
* label - the label of instances to start.
*
* @param {!object} event Cloud Function PubSub message event.
* @param {!object} callback Cloud Function PubSub callback indicating
* completion.
*/
exports.startInstancePubSub = async (event, context, callback) => {
try {
const payload = _validatePayload(
JSON.parse(Buffer.from(event.data, 'base64').toString())
);
const options = {filter: `labels.${payload.label}`};
const [vms] = await compute.getVMs(options);
await Promise.all(
vms.map(async instance => {
if (payload.zone === instance.zone.id) {
const [operation] = await compute
.zone(payload.zone)
.vm(instance.name)
.start();
// Operation pending
return operation.promise();
}
})
);
// Operation complete. Instance successfully started.
const message = 'Successfully started instance(s)';
console.log(message);
callback(null, message);
} catch (err) {
console.log(err);
callback(err);
}
};
/**
* Validates that a request payload contains the expected fields.
*
* @param {!object} payload the request payload to validate.
* @return {!object} the payload object.
*/
const _validatePayload = payload => {
if (!payload.zone) {
throw new Error("Attribute 'zone' missing from payload");
} else if (!payload.label) {
throw new Error("Attribute 'label' missing from payload");
}
return payload;
};
{
"name": "cloud-functions-schedule-instance",
"version": "0.1.0",
"private": true,
"license": "Apache-2.0",
"author": "Google Inc.",
"repository": {
"type": "git",
"url": "https://github.com/GoogleCloudPlatform/nodejs-docs-samples.git"
},
"engines": {
"node": ">=10.0.0"
},
"scripts": {
"test": "mocha test/*.test.js --timeout=20000"
},
"devDependencies": {
"mocha": "^8.0.0",
"proxyquire": "^2.0.0",
"sinon": "^9.0.0"
},
"dependencies": {
"@google-cloud/compute": "^2.0.0"
}
}
The Stop Function (if you need it) is created following the same steps, check out the Google Doc for the right code snippet.
Once you are done preparing the two functions, you should see them in the main pane:
Image(filename = PATH + "CloudFunctions.PNG", width=600, height=800)
You can test that the function correctly starts the VM by entering the Function and going in the Test mode. Again, follow the Google Doc, which is very informative ;)
The final step is to use Google Scheduler to define when the VM should be started and stopped. Since the container is launched upon initiation of the VM, launching it will automatically launch your code.
On the Google Scheduler page:
Pub/Sub
in Targetstart-VM-instance
Now you need to specific the labels of the VM needed to be executed. This is done by looking up the labels of the VM in GCE. Go to the VM instance that you previously created in GCE and look up the label assigned. In this case, it is: container-vm=cos-stable-85-13310-1041-24
Image(filename = PATH + "label_GCE.PNG", width=600, height=800)
You end up with this kind of window, which in my case defined a job to start each day at 6:35 in the morning:
Image(filename = PATH + "CScheduler_Start.PNG", width=600, height=800)
You can create a stop scheduler Job by taking similar steps.
This brings you to two scheduled jobs:
Image(filename = PATH + "Cloud_scheduler.PNG", width=600, height=800)
You can also test that the job is working correctly by pressing the RUN NOW
button on the start and later stop job. They should launch and stop your GCE VM, respectively.
That's it! We have imported a repo on GCP, containerized it and used GCP features to schedule its execution. Monitoring of the process can be done relatively easily by accessing the Logs
window in your GCE VM:
Image(filename = PATH + "logs.PNG", width=600, height=800)
This window will allow you to observe the logs of your code. As you can see from the screenshot above, I can see the print
commands being outputed in each line, which can be useful to debut and monitor how your application was deployed and executed.
Here some references I used when performing this project